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Abstract 

A double occurrence word w over a finite alphabet E is a word in 
which each alphabet letter appears exactly twice. Such words arise 
naturally in the study of topology, graph theory, and combinatorics. 
Recently, double occurrence words have been used for studying DNA 
recombination events. We develop formulas for counting and enu- 
merating several elementary classes of double occurrence words such 
as palindromic, irreducible, and strongly-irreducible words. 

1 Introduction 

A double occurrence word w of size n is a word containing n distinct letters 
in any order which appear exactly twice, i.e., the length of w is 2n. There 
are three common pictorial representations of double occurrence words: 
self-intersecting closed curves in M"^, chord diagrams, and linked diagrams 
as depicted in Figure [T] 

Topologically, a double occurrence word with n distinct letters can be 
interpreted as a closed curve traversing n fixed points in twice. Such 
a curve (also called an assembly graph 0) is self-intersecting and may 
contain over and under crossings when projected into the plane. Each 
curve of this type can be characterized through the double occurrence word 
corresponding to a path following the direction of the curve in relation to a 
fixed base point. Self-intersecting closed curves are closely related to Gauss 
words, knot diagrams, and their shadows 
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Figure 1: Self- intersecting closed curve (left), chord diagram (center), 
and linked diagram (right) representations of the double occurrence word 
121323. Base points, indicating the starting point for reading the word, are 
marked by || . 

Chord diagrams are defined in the following way. Start with a circle 
and place n distinctly labeled chords with distinct endpoints in any ar- 
rangement (possibly crossing) around the circle. Label the endpoints of 
each chord with the chord label. Fix a base point on the circle between any 
two chord endpoints on the circle. The resulting diagram is called a chord 
diagram. Each chord diagram has an associated double occurrence word 
formed by reading the labels of the endpoints, from the base point back to 
base point, clockwise around the circle. See [HIH] for more information on 
chord diagrams. 

A linked (or linearized chord |16| 1 diagram is a pairing of 2n distinct 
ordered points. Graphically, the ordered points are positioned on a line 
and their pairing is illustrated by an arc connecting them. Such a diagram 
can be specified by listing the pairs defined by the n arcs. See [TfllTSlfTS] . 
A linked diagram can be obtained from a chord diagram by cutting the 
outer circle at the base point. Conversely, if we arrange the points of the 
link diagram in a circle and mark a base point between the first and last 
point, the corresponding representation is a chord diagram. 

Since double occurrence words naturally arise in a variety of contexts, 
insight into their combinatorial structure enriches several fields simultane- 
ously. In this paper, we explore several classifications of double occurrence 
words based on separating larger double occurrence words into smaller dou- 
ble occurrence words. Further, we count and enumerate members of these 
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classes. 

Some of these formulas have been derived in completely different con- 
texts using a variety of approaches. Moreover none of the papers we came 
across seemed to contain a compilation of the known formulas. In this pa- 
per we give a unified approach to deriving these formulas and provide a 
new formula, giving what appears to be an unobserved integer sequence. 

We note that applications of double occurrence words extend to other 
disciplines. In |2.2[ we observe that certain double occurrence words are 
related to particular Feynman diagrams in physics, and in Section [4] we 
establish a connection between double occurrence words and DNA recom- 
bination events. 

2 Preliminaries 

2.1 Types of Equivalences 

For convenience, we let S = {1, 2, . . . , n} and relabel each double occurrence 
word such that when i appears for the first time in the word, it is preceded 
byl,2,...,i — 1. Double occurrence words labeled by this convention are 
said to be in ascending order. Two double occurrence words are said to be 
equivalent if they are equal after being relabeled in ascending order. If two 
double occurrence words are not equivalent, they are said to be distinct. 
Throughout this paper, we shall assume that all double occurrence words 
are in ascending order unless stated otherwise. 

For example, 122313 is a double occurrence word in ascending order. Its 
reverse with the same letters is 313221, which is not in ascending order. By 
relabeling 313221 in asscending order we obtain 121332. In this example 
122313 is distinct from its reverse 121332. However it is easily checked that 
123312 is equivalent to its reverse which motivates the following classifica- 
tion. 

Definition 2.1 A double occurrence word is palindromic (or symmetric) if 
it is equivalent to its reverse. A double occurrence word that is palindromic 
is called a palindrome. 

In all three interpretations of double occurrence words (topological, 
graph theoretic, and linked diagrams), the reverse word induces a diagram. 
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isomorphic to the original, with the orientation reversed. In the topological 
sense, the orientation refers to the orientation of the closed curve. While 
the reverse of a linked diagram may be interpreted as reading the diagram 
right-to-lcft rather than Icft-to-right. Finally, the reverse chord diagram 
may be achieved by reading the letters of the circle in a counter-clockwise 
fashion rather than clockwise. 

If we wish to count the non-isomorphic diagrams generated from dou- 
ble occurrence words, we observe that each diagram can have exactly two 
orientations. Thus, no more than two distinct double occurrence words can 
correspond to the same diagram with regard to a starting base point. 

If a diagram corresponds to a palindrome, only one distinct double oc- 
currence word is associated with the diagram. Therefore we may count the 
number of non-isomorphic diagrams with regard to a base point as 

Total Diagrams = (# of Palindromes) + -(^^ of Non-Palindromes) 

(# of D.O. Words) + (# of Palindromes) , , 

2 ^ ' 

We will make use of this formula extensively throughout Section [3] to count 
the number of distinct diagrams corresponding to double occurrence words 
with each separation property. 

It should be noted that omitting the base point in the closed curve or 
chord diagram makes it possible for more than two double occurrence words 
to be associated with the same diagram. For instance, rotating the base 
point around the circle in Figure[l]would lead to 121323, 213231, and 132312 
which is 121323, 123132, and 123213 in ascending order, respectively. We 
do not consider isomorphisms of this type in this paper. 

2.2 Types of Separations 

As mentioned in the introduction, double occurrence words regularly appear 
in various fields of mathematics. Unfortunately as a result, there are several 
different, and sometimes conflicting, definitions used to express identical 
properties. We shall make note of these discrepancies in notation as they 
come up. 

Jacques Touchard was one of the first researchers to comprehensively 
consider the counting of double occurrence words. In his paper [17], he 



4 



classified several types of linked diagrams and enumerated the number of 
diagrams containing a fixed number of crossings. He introduced the classi- 
fication of "unique systems" and "proper unique systems" which coincide 
with the following two definitions for irreducible and strongly-irreducible 
words. 

Definition 2.2 If a double occurrence word w can be written as a product 
w — uv oi two non-empty double occurrence words u,v, then w is called 
reducible; otherwise, it is called irreducible. 

The number of irreducible double occurrence words has a close con- 
nection with the number of non-isomorphic unlabeled connected Feynman 
diagrams (also called irreducible Feynman diagrams (14j ) arising in a sim- 
plified model of quantum electrodynamics [U |^ . 

This definition for irreducibility agrees with [1] and [2] yet conflicts 
with [ini where "irreducible" is used for our notion of strongly-irreducible 
as defined below. 

Definition 2.3 A non-empty double occurrence word is strongly-irreducible 
if it does not contain a proper sub-word that is also a double occurrence 
word. 

The double occurrence word 12213434 is reducible because it can be 
written as the product of the two double occurrence words 1221 and 3434, 
but 12344123 is irreducible. However, since 44 is a proper sub-word of 
12344123 it is not strongly-irreducible. The word 12132434 is strongly- 
irreducible. By definition, strongly-irreducible words are also irreducible, 
so 12132434 is irreducible as well. In particular 11 is strongly-irreducible. 

Strongly-irreducible double occurrence words are also called connected 
words This terminology is motivated by the circle graph associated 
with a chord diagram. The circle graph is formed by representing the 
chords as vertices and the intersection of those chords as edges in the graph. 
In the topological convention, a circle graph is also called an interlinking 
graph [3]. Without too many difficulties it can be proven that a double 
occurrence word is strongly-irreducible if and only if the circle graph of the 
corresponding chord diagram, or interlinking graph of the corresponding 
closed curve, is connected. 
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Lemma 2.4 Every double occurrence word contains a strongly-irreducible 
sub-word. 

Proof. If a double occurrence word w is strongly-irreducible, then w itself 
is a strongly-irreducible sub- word of w. Double occurrence words which are 
not strongly-irreducible, by definition, contain a proper sub-word wi which 
is a double occurrence word and is either strongly- irreducible or not. If the 
sub-word is not strongly-irreducible we check the reducibility of its proper 
sub- word W2 ■ Since w has finite length, we must reach a double occurrence 
word Wi, which is a strongly-irreducible proper sub- word of Wi-i, through 
finite recursion. Since Wi must be a proper sub-word of w, this completes 
the proof. □ 

3 Counting 

It is well known [51 [51 [13 [TB] and straightforward to show that the total 
number of double occurrence words is (2n— 1)!!. Formula (*) motivates us 
to enumerate the number of double occurrence words which correspond to 
palindromes. 

3.1 Palindromes 

Theorem 3.1 The number Ln of palindromic double occurrence words of 
length 2n, is given by 



Proof. Observe that ii = 1 since there is a unique one letter palindrome, 
and 1/2 = 3 because 1122, 1212, and 1221 are all the two letter palindromes. 

If a double occurrence word w of size n > 2 is a palindrome beginning 
and ending with 1, then the word formed by removing both Is is also a 
palindrome. Hence there are palindromes with n letters that start 

and end with 1. 

Now consider a word w of size n > 3 where the second symbol 1 is at 
the position j ^ 2n. Note that there are 2n — 2 possible positions for j. 
Then the word w is a palindrome if and only if w contains the same symbol 




for n > 1. 
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s at the positions 2n and n — j + 1. Removing symbols 1 and s from w, and 
relabeling the resulting word accordingly, produces a palindrome of length 
n — 2. Hence there are L„_2 palindromes that have a symbol 1 at the jth 
position for 2 < j < 2n — 1. 

According to the above argument, 

L„ — Ln-i + (2n — 2)L„_2 for n > 3, Li = 1 and L2 ~ S 

is a recurrence relation for L„. It is known llj that the closed formula for 
this recursive relations is as stated. □ 
This formula is expressed without proof in a comment by Ross Drewe in 
A047974 of the OEIS [H] in 2008, but this may not be the original source. 
Similar results, such as the number of palindromic chord diagrams without 
a base point, were known in 2000 |16j . The above proof reprinted here is 
found in 2J. 

3.2 Irreducibles 

Though Touchard introduced the classification of irreducible words in 1952, 
there seems to be little continuation of his efforts. In 2000, Martin and 
Kearney expressed the number of irreducible words in the broader con- 
text of solutions to generating functions. Here, we address the count and 
construction of both the irreducible double occurrence words and irre- 
ducible palindromes directly. 

Lemma 3.2 The number of irreducible double occurrence words /„ with 
length 2n satisfies the recurrence formula Ii = 1 and 

n-l 

In = (271 - 1)!! - In-k {2k - 1)!! for n>2. 

k=l 

Proof. We shall count the number of irreducible double occurrence words 
by subtracting the number of reducible double occurrence words from the 
total number of double occurrence words of length 2?! and show that each 
reducible word may be written as the product of an irreducible word and 
a non-empty double occurrence word. 

Without loss of generality, let w — uv he a. reducible double occurrence 
word of length 2n such that u is also an irreducible double occurrence word. 
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Note that every proper prefix of an irreducible word is not necessarily a 
double occurrence word. If the length of v is 2fc, for some 1 < k < n — 1, 
then the length of u is 2(n — k). By construction, u is irreducible and is 
counted among In-k and v is counted among the {2k — 1)!! possible double 
occurrence words of length 2k. 

Summing over the possible symbols in v yields the desired count. Since 
u is irreducible and v is non-empty, this ensures that each reducible double 
occurrence word w is counted exactly once. □ 

Theorem 3.3 The number of irreducible palindromes with length 2n 
satisfies the recurrence formula Ji — I and 

L«/2J 

Jn = L„- 51 (2fc - 1)!! Jn-2k for n>2. 

k = l 

where Ln is the total number of palindromes with length 2n. 

Proof. Similar to the above argument, we first count the reducible palin- 
dromes and subtract them from the total number of palindromic words. 

Suppose w is a reducible double occurrence word with length 2n. Then 
w can be written as it; = uvu' where u is an arbitrary double occurrence 
word with length 2k (1 < k < [n/2j), u' is the double occurrence word 
corresponding to u by reversing the orientation, and v is an irreducible 
palindrome with length 2(n — 2k). □ 

Though the number of irreducible double occurrence words appears in 
the OEIS (A000698), we note that the number of irreducible palindromes is 
the only sequence discussed in this paper which is not currently listed in the 
OEIS [TT]. See Tablejl] Table[2] and Table[3]for the number of irreducibles, 
strong-irreducibles, and the number of non-isomorphic diagrams as defined 
according to (*), respectively. 

3.3 Strong-Irreducibles 

The classification of strongly-irreducible double occurrence words was in- 
troduced in |18| and the first counting of the strong-irreducibles was done 
by Stein in (TS]. Stein was the first to count both the strongly- irreducible 
double occurrence words and the strongly-irreducible palindromes, but his 
counting methods and recursive formulas were simplified in [10] and later 
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by Klazar in [5]. In Theorem 3.5 we present a proof similar to [S] expressed 
in terms of language theory. 

Using language theory to count double occurrence words led directly 
to a characterization of the strongly-irreducible double occurrence words, 
which we express in Lemma |3.4[ and Theorem |3.5| follows as a natural 
consequence. 

Lemma 3.4 Every strongly-irreducible double occurrence word w in as- 
cending order may be written in a unique form as w — 1miUi1?;2U2 where 
luilu2 and V1V2 are both strongly-irreducible. 

Proof. Let w be strongly-irreducible. Every double occurrence word w in 
ascending order must be of the form w — lpilp2- Delete both I's. Then we 
have a double occurrence word P1P2 = Uixu2 where x is the first strongly- 
irreducible double occurrence word of smallest positive length. Thus ui 
and U2 are uniquely defined. Note that Ui and U2 may be empty words. 

Let vi be the prefix of x which is a suffix of pi and let V2 be the suffix 
of X which is the prefix of p2. This means that x = V1V2. Neither vi 
nor V2 is empty as it would imply that a; is a sub-word of either pi or p2 
which would constitute a proper sub- word of w. Since w is taken to be 
strongly-irreducible, this cannot be. 

We show that luilu2 is strongly-irreducible. Suppose not. Then there 
exists a non-empty double occurrence sub-word z in either ui or U2 which 
implies that w contains z and is not strongly-irreducible. This is a contra- 
diction. Hence luilu2 and V1V2 are strongly-irreducible. □ 

Theorem 3.5 The number of strongly-irreducible double occurrence words 
Sn with length 2n satisfies the recurrence formula 

n-l 



Sn — (n — 1) ^ SkSn-k, 



fe=l 

where Si — 1 and n > 2. 

Proof. Note that the only strongly-irreducible double occurrence word of 
length 2 is 11, i.e., 5*1 — 1. 

Let u and v be strongly-irreducible double occurrence words such that 
the length of v is 2k, the length of u is 2{n — fc), u = luilu2, and v = V1V2. 
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Since the length of v is 2k, there are 2k— 1 ways to write v — viV2 with vi, V2 
not empty. By Lemma |3.4[ each strongly-irreducible double occurrence 
word w of length 2n can be uniquely represented as w — 1uiWi1w2M2- Hence 
there are 2fc — 1 possibilities for such w's to be formed from each u and v. 

Since there are Sn-k choices for u and Sk choices for v the total counting 
for Sn when n > 2 is given by 

Ti— 1 n — 1 

Sn = ~ l)SkSn-k = {n ~ 1) X! ^kSn-k- □ 

k=l k=l 

For completeness, we state Klazar's counting formula of the strongly- irreducible 
palindromes. See [8^ for the proof. 

Theorem 3.6 Let Sn and Tn be the number of strongly-irreducible double 
occurrence words and strongly-irreducible palindromes of length 2n, respec- 
tively. Then 

n-1 L"/2J 
Tn = T^Tn-^ + ^ (2n - 4z - l)S,Tn-2i 

i=l i=l 

for n > 2 where Tq — —1 and Ti = 1 . 



Theorem 3.5 and Theorem 3.6 correspond to the sequences A000699 
and A004300 listed in the OEIS. For the first few values of these sequences, 
see Table [Hand Table |2] 



4 Connection with DNA recombination 

Several species of ciliates, such as Oxytricha and Stylonychia, undergo mas- 
sive genome rearrangement during sexual reproduction. These massively 
occurring recombination processes make them ideal model organisms to 
study gene rearrangements. See [5] and references therein for details of the 
descriptions below. 

There are two types of nuclei, a micronucleus and a macronucleus, in 
these species. Micronuclear genes contain both coding and non-coding seg- 
ments which are reassembled to macronuclear genes during sexual repro- 
duction. The coding segments, called macronuclear destined sequences or 
MDSs, are part of the final unscrambled gene. The individual MDSs within 
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Figure 2: Scrambled Actin I micronuclear gene in Oxytricha nova |13) . 



MDSi n MDS, Q MDS^ Q MDS^ Q MDS^ Q MDS^ Q MDS, Q MDSg Q MDSq 



Figure 3: Unscrambled Actin I niacronuclear gene in Oxytricha nova |13j . 



a micronuclear gene may be separated by non-coding segments, called in- 
ternal eliminated sequences or lESs, which are excised during the recombi- 
nation process. 

In relation to an unscrambled macronuclear gene (Fig. |3]), a scrambled 
micronuclear gene (Fig. [2]) may have permuted or inverted MDS segments 
separated by lESs. Formation of the macronuclear genes in these ciliates 
thus requires any combination of the following three events: unscrambling 
of segment order, DNA inversion, and lES removal. 

Since the lESs are removed in the unscrambled gene, it is only necessary 
to record the order and direction of the MDSs in the scrambled gene. A 
micronuclear arrangement (cf. [S]) is a sequence of permuted and inverted 
MDSs. In particular, each micronuclear arrangement a with k MDSs has 
a corresponding permutation CTq : [k] — ^ [k] and a signing function : 
[k] — > { — l.+l} which uniquely defines the arrangement. A sign of — 1 
indicates that an MDS is inverted with respect to the gene sequence in the 
macronuclear gene while a sign of -1-1 indicates a regular orientation. 

For example, the micronuclear arrangement of the Actin I gene in Figure 

El is 

M^^ M:^^ M:^^ 

or more commonly denoted 

M3 M4 Mg M5 M7 Mg M 2 Ml Mg 

where M2 indicates that MDS2 is inverted in the scrambled micronuclear 
gene. 

Proposition 4.1 Let An he the number of micronuclear arrangements of 
n MDSs. Then 

An = 2"n! = (2n)!!. 
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Proof. Each micronuclear arrangement a with n MDSs is uniquely defined 
by its corresponding permutation CTq, and signing function Since each 
MDS may be signed in one of two ways, there are 2" ways to sign the nl 
permutations of all arrangements of a with n MDSs. □ 

The exact process by which the scrambled micronuclear gene recom- 
bines into an unscrambled macronuclear gene is unknown. However it is 
theorized [12] that short sequences of nucleotides, called pointers, found at 
the beginning and end of each MDS, guide the recombination process. In 
fact, each MDS is characterized by its pointers in the following sense. 

Each MDS is labeled according to its order in the unscrambled macronu- 
clear gene. The pointers flanking the MDSs correspond to the order of the 
MDSs such that the pointer sequence at the end of the ith MDS coincides 
with the pointer sequence at the beginning of the (i + l)th MDS. lESs are 
excised and their coding is not necessary. Since the pointers at the begin- 
ning and end of the whole gene do not align with any other pointers, we 
omit them. Mathematically, this translates to the following. 

Let An be the set of all micronuclear arrangements with n MDSs and 
Xn be the set of all double occurrence words with length 2n. Then g : An -> 
OCn-i is a homomorphism which translates a micronuclear arrangement to 
the ordered sequence of pointers which describes it, i.e., 

1. g{M^^) K> (1) and giM+^) ^ (1) 

2. g(Mri)^(i)(z-l) 

3. g{M+^)^{t~im 

4. g{M-^) ^(n-l) and g{M+^) ^ {n - 1). 

For the micronuclear arrangement a = M2^M^^AI^^M^^M^^, 

g{a) = (2)(1)(3)(4)(1)(4)(2)(3) 

which corresponds to the double occurrence word 12342413 in ascending 
order. Therefore each scrambled micronuclear gene corresponds to a mi- 
cronuclear arrangement which, in turn, has an associated double occurrence 
word. 

A double occurrence word is called realizable if it has a corresponding 
micronuclear arrangement. The shortest double occurrence word which is 
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not realizable is 11233244. For further information on realizable double 
occurrence words see i2j- 

5 Conclusions 

Double occurrence words are studied in topology, graph theory, and com- 
binatorics by way of self- intersecting closed curves in M^, chord graphs and 
linked diagrams, respectively. Their applications extend beyond abstraction 
to other disciplines such as physics and genetics. We considered the count- 
ing and enumeration of several reducibility classes of double occurrence 
words which directly led to a new characterization of strongly-irreducible 
double occurrence words. Further, all but one of the enumerated sequences 
are listed in the OEIS [TT], which suggests both the relevance of the previ- 
ously listed enumerations and the novelty of the unlisted irreducible palin- 
drome count. It should be noted that all the counting arguments present 
in this paper followed a similar theme: separate the classes of double oc- 
currence words into palindromes and non-palindromes and describe the 
construction of large double occurrence words from smaller double occur- 
rence words. We believe that the counting techniques presented here could 
be used to enumerate new classes of double occurrence words as they arise 
in future research. 
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Symbols 


All 


Irreducible 


Strongly Irreducible 


1 


1 


1 


1 


2 


3 


2 


1 


3 


15 


10 


4 


4 


105 


74 


27 


5 


945 


706 


248 


6 


10395 


8162 


2830 


7 


135135 


110410 


38232 


8 


2027025 


1708394 


593859 


9 


34459425 


29752066 


10401712 


10 


654729075 


576037442 


202601898 


11 


13749310575 


12277827850 


4342263000 


12 


316234143225 


285764591114 


101551822350 


OEIS 


A001147 (Kn) 


A000698 (/„) 


A000699 (Sn) 



Table 1: All Double Occurrence Words. 



Symbols 


All 


Irreducible 


Strongly Irreducible 


1 


1 


1 


1 


2 


3 


2 


1 


3 


7 


6 


2 


4 


25 


20 


7 


5 


81 


72 


22 


6 


331 


290 


96 


7 


1303 


1198 


380 


8 


5937 


5452 


1853 


9 


26785 


25176 


8510 


10 


133651 


125874 


44940 


11 


669351 


637926 


229836 


12 


3609673 


3448708 


1296410 


OEIS 


A047974 (L„) 


iJn) 


A004300 (T„) 



Table 2: Palindromic Double Occurrence Words. 
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Symbols 


All 


Irreducible 


Strongly Irreducible 


1 


1 


1 


1 


2 


3 


2 


1 


3 


11 


8 


3 


4 


65 


47 


17 


5 


513 


389 


135 


6 


5363 


4226 


1463 


OEIS 


A001147 (Ka) 
A047974 (L„) 


A000698 (/„) 

(Jn) 


A000699 (Sn) 
A004300 (r„) 



Table 3: Non-isomorphic diagrams in (*) are obtained by summing all words 
with the palindromes of each class and halving the total. These sequences 
do not appear in the OEIS [TT], but can be built from listed sequences. 
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